Journal of Clinical Epidemiology
○ Elsevier BV
Preprints posted in the last 90 days, ranked by how well they match Journal of Clinical Epidemiology's content profile, based on 28 papers previously published here. The average preprint has a 0.03% match score for this journal, so anything above that is already an above-average fit.
Jones, L.; Barnett, A.; Hartel, G.; Vagenas, D.
Show abstract
Background: In health research, variability in modelling decisions can lead to different conclusions even when the same data are analysed, a challenge known as inferential reproducibility. In linear regression analyses, incorrect handling of key assumptions, such as normality of the residuals and linearity, can undermine reproducibility. This study examines how violations of these assumptions influence inferential conclusions when the same data are reanalysed. Methods: We randomly sampled 95 health-related PLOS ONE papers from 2019 that reported linear regression in their methods. Data were available for 43 papers, and 20 were assessed for computational reproducibility, with three models per paper evaluated. The 14 papers that included a model at least partially computationally reproduced were then examined for inferential reproducibility. To assess the impact of assumption violations, differences in coefficients, 95% confidence intervals, and model fit were compared. Results: Of the fourteen papers assessed, only three were inferentially reproducible. The most frequently violated assumptions were normality and independence, each occurring in eight papers. Violations of independence were particularly consequential and were commonly associated with inferential failure. Although reproduced analyses often retained the same binary statistical significance classification as the original studies, confidence intervals were frequently wider, indicating greater uncertainty and reduced precision. Such uncertainty may affect the interpretation of results and, in turn, influence treatment decisions and clinical practice. Conclusion: Our findings demonstrate that substantial violations of key modelling assumptions often went undetected by authors and peer reviewers and, in many cases, were associated with inferential reproducibility failure. This highlights the need for stronger statistical education and greater transparency in modelling decisions. Rather than applying rigid or misinformed rules, such as incorrectly testing the normality of the outcome variable, researchers should adopt modelling frameworks guided by the research question and the study design. When assumptions are violated, appropriate alternatives, such as robust methods, bootstrapping, generalized linear models, or mixed-effects models, should be considered. Given that assumption violations were common even in relatively simple regression models, early and sustained collaboration with statisticians is critical for supporting robust, defensible, and clinically meaningful conclusions.
Jones, L. V.; Barnett, A.; Hartel, G.; Vagenas, D.
Show abstract
Background: Reproducibility concerns in health research have grown, as many published results fail to be independently reproduced. Achieving computational reproducibility, where others can replicate the same results using the same methods, requires transparent reporting of statistical tests, models, and software use. While data-sharing initiatives have improved accessibility, the actual usability of shared data for reproducing research findings remains underexplored. Addressing this gap is crucial for advancing open science and ensuring that shared data meaningfully support reproducibility and enable collaboration, thereby strengthening evidence-based policy and practice. Methods: A random sample of 95 PLOS ONE health research papers from 2019 reporting linear regression was assessed for data-sharing practices and computational reproducibility. Data were accessible for 43 papers. From the randomly selected sample, the first 20 papers with available data were assessed for computational reproducibility. Three regression models per paper were reanalysed. Results: Of the 95 papers, 68 reported having data available, but 25 of these lacked the data required to reproduce the linear regression models. Only eight of 20 papers we analysed were computationally reproducible. A major barrier to reproducing the analyses was the great difficulty in matching the variables described in the paper to those in the data. Papers sometimes failed to be reproduced because the methods were not adequately described, including variable adjustments and data exclusions. Conclusion: More than half (60%) of analysed studies were not computationally reproducible, raising concerns about the credibility of the reported results and highlighting the need for greater transparency and rigour in research reporting. When data are made available, authors should provide a corresponding data dictionary with variable labels that match those used in the paper. Analysis code, model specifications, and any supporting materials detailing the steps required to reproduce the results should be deposited in a publicly accessible repository or included as supplementary files. To increase the reproducibility of statistical results, we propose a Model Location and Specification Table (MLast), which tracks where and what analyses were performed. In conjunction with a data dictionary, MLast enables the mapping of analyses, greatly aiding computational reproducibility.
Fulbright, H. A.; Marshall, D.; Evans, C.; Corbett, M.
Show abstract
ObjectivesTo inform users about the impact of two updated study filters for limiting database search results to randomized controlled trials on Ovid MEDLINE: a sensitivity-maximizing version (SM) and a sensitivity-and-precision-maximizing version (SaPM). To provide an updated understanding of how they compare to each other. MethodsUsing the final included records of 14 Cochrane reviews that had used the SM filter, we determined how many available records on Ovid MEDLINE would have been retrieved with each filter; investigated why records were missed; the unique yield; precision; and number-needed-to-read (NNR) for each filter. We also performed forwards and backwards citation searching on missed records (to determine if this could mitigate the risk of missing includes) and calculated the percentage change in the overall number-needed-to-screen (ONNS) when applying each filter to reproduction strategies. ResultsOn average, the SaPM filter reduced ONNS by 83% and retrieved 95.9% of includes compared with 98.2% retrieved by the SM filter. The SaPM filter offered a further 28.2% mean reduction in ONNS over the SM filter. The SM filter had a unique yield of 12 and a precision of 1.5%, versus a unique yield of three and precision of 4.4% for the SaPM filter. NNR was 68 for the SaPM filter versus 189 for the SM filter. ConclusionThe SaPM filter reduced the screening burden with minimal risk of missing eligible records (which could be mitigated by citation searching). Decisions about which filter to use should consider both the needs and resources of the review.
Das, P.; Schneider, J.; Mayo-Wilson, E.; Kilicoglu, H.; Menke, J. D.; Nam, D.; Ninan, K.; Oberste, J.-P.; Troy, A. M.; Ying, X.; Holt, A. W.; Smalheiser, N. R.
Show abstract
Objectives: Study design indexing of biomedical publications is crucial for evidence retrieval and synthesis. We sought to evaluate the accuracy and suitability of a transformer-based model (TM) for indexing clinical study designs, in comparison to National Library of Medicine (NLM) indexing. However, this is challenging for at least three reasons: First, to date, all automated systems have been trained and evaluated on manual NLM indexing assignments, itself subject to errors. Second, TM's probabilistic predictive scores take into account uncertainty, and can be converted to TRUE/FALSE assignments in different ways depending on the needs of users, while NLM labels are categorical. Third, our goal (to tag articles only that exhibit a given design) differs from NLM which tags articles that both discuss as well as exhibit that design. Materials and Methods: Therefore, we carried out a limited evaluation of the TM model that focuses only on the articles that received the most confident predictions, that is, the highest scores that are almost certainly TRUE and the lowest scores that are almost certainly FALSE, but which disagreed with NLM assignments. This was performed both for articles published in 2016 (when NLM decisions were manual) and in 2025 (when NLM decisions were automated). To establish ground truth, dual annotators indexed the articles independently, following written definitions, for four prominent study designs--cohort, case-control, cross-sectional, and case report. Results: For three designs (case-control, case report, cross-sectional), the articles having the top 100 predictive TM scores (when NLM failed to assign that design) were judged to exhibit that design in the great majority (86-100%) of cases. Conversely, the articles having the lowest 100 predictive TM scores (when NLM did assign the study design) exhibited the design only in relatively few (0-21%) of cases. The most confident predictions of the TM model were highly accurate and not redundant with automated NLM indexing; the exception was cohort studies articles, in which both TM and NLM labels showed high error rates of both omission and commission. Discussion and Conclusion: TM may have value for identifying articles exhibiting study designs, which is especially important for clinical decision-making as well as systematic reviews and other evidence syntheses. NLM indexing of cohort studies cannot be regarded as a reliable gold standard for training or evaluation of automated systems, warranting efforts to create a new manually annotated corpus.
Ahnström, L.; Bruckner, T.; Aspromonti, D. A.; Caquelin, L.; Cummins, J.; DeVito, N. J.; Axfors, C.; Ioannidis, J. P. A.; Nilsonne, G.
Show abstract
BackgroundMultiple stakeholders need to locate results of registered clinical trials but frequently struggle to find them. Summary results of clinical trials are often not published in trial registries, and publications containing trial results are often not explicitly linked to their respective trial registrations. Finding these results is important to researchers, systematic reviewers, research funders, regulators, clinical practitioners, and patients. MethodsWe developed TrialScout, a computer program that uses a large language model to match clinical trials registered on ClinicalTrials.gov with corresponding result publications indexed in PubMed. TrialScouts performance was evaluated through comparison to human-coded matches from previous studies of results reporting rates. Subsequently, TrialScout was applied to a random sample of 9,600 completed or terminated trials. ResultsTrialScout had a sensitivity of 92.5% and a specificity of 81.2% compared to human coders. Manual review of 200 cases where TrialScout disagreed with human researchers showed that a majority (123/200, 61.5%, 95% CI, 54.4-68.3%) of disagreements were due to human errors. When used on 9,600 sampled trials in ClinicalTrials.gov, TrialScout found result publications for 6,110 (63.6%) of trials. DiscussionTrialScout reliably located results of completed clinical trials. The tool offers benefits in terms of speed and efficiency. Estimating TrialScouts accuracy is limited by the lack of a true gold standard. TrialScout can accelerate the process of locating trial results in the scientific literature and can assist in monitoring trial reporting practices.
Fagerberg, P.; Sallander, O.; Vikhe Patil, K.; Thunborg, C.; Lundstrom, L.; Berg, A.; Nyman, A.; Borg, N.; Linden, T.
Show abstract
Title and abstract screening limit the timeliness of systematic reviews used for clinical guidelines. We evaluated audited large language model (LLM) triage at Sweden's National Board of Health and Welfare. Ten LLMs from five model families were tested on 419 Cochrane reviews comprising 26,892 records, and the selected ensemble was externally validated on 133 reviews including 8,501 records matched to planned guideline topics. The same locked model pair was then used prospectively across 24 systematic reviews in two national guideline programmes. On the 419-review selection benchmark, the selected Gemini-3-flash plus GPT-5.1 ensemble achieved 98.0% (95% CI, 97.3-98.7) mean review-level sensitivity, while topic-matched validation yielded 96.7% sensitivity (95% CI, 93.7-98.9). Prospective deployment screened 74,679 records, placed 63,858 (85.5%) in the AI-excluded pool and reduced estimated first-pass screening effort from 415 to 34 person-days. Across 600 randomly sampled AI-excluded records from the migraine and dementia programmes, none was confirmed as a final false negative after post-unblinding adjudication; across the completed 680-record audit, all 38 final retained records had been AI flagged, whereas locked blinded human consensus missed seven. These findings support locked, audited LLM triage, with human oversight and programme-specific monitoring, for systematic reviews used in national guidelines.
Etminan, M.; Rezaeianzadeh, R.; Douros, A.
Show abstract
BackgroundThe rapid expansion of medical literature has led to substantial variability and frequent contradictions in study findings, making it increasingly difficult to distinguish meaningful signals from noise. Much of this variability arises from differences in study methodology, where biases such as confounding, selection bias, and reverse causation can drive spurious associations. While artificial intelligence (AI)-assisted tools have been developed to support risk-of-bias assessment, most are designed for systematic reviews and are not tailored to identifying specific epidemiologic biases in observational studies. This highlights the need for structured, scalable approaches to evaluate study validity in real-world evidence. ObjectiveTo develop and validate an AI-assisted, expert-informed, rule-based framework (EpiVise) for systematically identifying and classifying key sources of bias in pharmacoepidemiologic studies, and to assess its agreement with expert evaluation. MethodsWe conducted a validation study using recently published pharmacoepidemiologic studies from high-impact journals (post-2025). Each study was independently assessed by the framework and two expert epidemiologists, across predefined bias domains, including measured confounding, confounding by indication, selection bias, immortal time bias, and disease latency. Agreement was evaluated using weighted kappa statistics. In the absence of a gold standard, expert judgment served as the reference benchmark. In a second phase, synthetic study scenarios with predefined embedded biases were constructed to assess the frameworks ability to detect known bias structures under controlled conditions. ResultsIn analyses of published studies (10 studies; 60 ratings), agreement between the framework and expert assessments was substantial ({kappa} = 0.75; 95% confidence interval [CI], 0.60-0.86), with 12 discordant ratings (20.0%), all limited to adjacent categories and occurring primarily in the confounding by indication and selection bias domains. In synthetic study scenarios (10 studies; 50 ratings), agreement was similarly substantial, with 42 of 50 ratings concordant (84%) and a weighted kappa of 0.77 (95% CI, 0.67-0.87); discordances included both adjacent-category and extreme disagreements and were concentrated in confounding by indication, selection bias, and prevalent user bias domains. ConclusionsThis AI-assisted, expert-informed framework, EpiVise provides a scalable and reproducible approach for evaluating epidemiologic study validity, substantial demonstrating agreement comparable to expert assessment. By systematically identifying key sources of bias, the framework has the potential to enhance the rigor and consistency of evidence evaluation, support peer review, and inform clinical, regulatory, and policy decision-making. Further validation across broader study designs and domains is warranted.
Fazeli, M. S.; Kasireddy, E.; Pourrahmat, M.-M.; Chow, C.; Collet, J. P.
Show abstract
Background: Systematic literature reviews (SLRs) are essential in medical research, but are often time-consuming and costly, necessitating more efficient methods while maintaining accuracy. Objective: This study assessed the performance of a GPT-4o mini large language model (LLM) in automating the first phase of study selection based on titles and abstracts in systematic reviews. Specifically, we evaluated whether the model improved efficiency without compromising on quality. Methods: Structured prompts were created for a GPT-4o mini LLM to facilitate title and abstract screening. The model's performance was evaluated against expert human reviewers across five systematic reviews on inclusion rates, sensitivity, specificity, accuracy, positive predictive value, and negative predictive value. Results: The model screened a total of 15,605 records. It included a higher percentage of studies than human screeners, with 3.5% (n=549/15,605) true positives and 14.2% (n=2,218/15,605) false positives. The model achieved an overall accuracy of 85.1%, with a sensitivity of 83.2% and specificity of 85.2%. The positive predictive value was 19.8%, while the negative predictive value was 99.1%. The model was able to screen 1,000 titles and abstracts in 40 minutes, compared to 16 hours required by a human reviewer. Conclusion: This study demonstrated a strong performance and efficiency in the automation of title and abstract screening in SLRs using an advanced LLM. Further refinements could optimize the balance between sensitivity and specificity, supporting broader implementation in evidence synthesis. A hybrid AI-human approach is recommended to ensure accuracy, reduce reviewer burden, and maintain the methodological rigor required for high-quality SLRs.
Taherifard, E.; Mooghali, M.; Hakimian, H. R.; Mane, S. R.; Fu, M.; Bamford, S.; Berlin, J. A.; Childers, K.; Desai, N. R.; Gross, C. P.; Hewens, D.; Lehman, R.; Ritchie, J. D.; Sargood, T.; Waldstreicher, J.; Wallach, J. D.; Willeford, M. K.; Krumholz, H. M.; Ross, J. S.
Show abstract
ObjectiveTo assess the number, timing of publication, characteristics, and scientific impact of secondary publications generated using individual participant-level data (IPD) from a portfolio of Johnson & Johnson-sponsored clinical trials shared with external investigators through a data sharing platform. DesignCross-sectional study. SettingYale University Open Data Access (YODA) Project platform. ParticipantsJohnson & Johnson-sponsored clinical trials listed on the YODA Project platform with IPD available for external sharing as of December 31, 2021, and with a full-length, peer-reviewed publication (i.e., primary publication) reporting primary endpoint results by the original trial investigators. Main outcome measuresNumber, timing of publication, research objectives, analysis type, and scientific impact of secondary publications using IPD from these trials identified through citation searches of primary publications in Web of Science through June 2025. Scientific impact metrics included journal impact factor, annual citation count, annual Altmetric Attention Score, and annual Mendeley reader count. Secondary publications were classified as internal (authored by at least one original trial investigator) or external. ResultsAmong 336 eligible trials, 265 (78.9%) had at least one associated secondary publication, totaling 1,167 secondary publications, of which 209 (17.9%) were external. Among external secondary publications for which the data access mechanism was reported (n=190; 90.9%), most obtained access through data sharing platforms (n=161; 84.7%), primarily the YODA Project (n=157; 82.6%). All secondary publications published from 3 years before through the first 2 years after the primary publication (n=161) were internal (100%). Over time, however, external publications increased steadily, exceeding 50% of all secondary publications by year 11 and thereafter. External secondary publications were more frequently pooled analyses (151/209 [72.2%] vs 534/958 [55.7%]; P<0.001). Predictive or prognostic modelling (108/209 [51.7%] vs 322/958 [33.6%]; P<0.001), development of statistical models or algorithms (60/209 [28.7%] vs 114/958 [11.9%]; P<0.001), and validation of existing methods, models, or risk scores (32/209 [15.3%] vs 66/958 [6.9%]; P<0.001) were more frequent among external than internal secondary publications. Compared to internal secondary publications, external secondary publications were published in journals with higher impact factors (median, 6.7 [IQR, 3.4-16.6] vs 4.6 [2.9-10.2]; P=0.002) and had higher annual Altmetric Attention Scores (median, 2.1 [0.7-7.1] vs 0.6 [0.3-2.3]; P<0.001), but lower annual citation counts (median, 2.7 [1.1-5.6] vs 3.4 [1.6-7.5]; P<0.001) and were less likely to be cited in clinical guidelines (21/184 [11.4%] vs 235/805 [29.2%], P<0.001) or policy documents (14/184 [7.6%] vs 206/805 [25.6%], P<0.001); there was no difference in annual Mendeley reader counts (median, 7.4 [3.9-13.0] vs 8.0 [5.1-13.6], P=0.13). ConclusionsClinical trial data shared with external investigators through a data sharing platform generated substantial and sustained secondary research by both original trial investigators and external investigators. The proportion of secondary publications from any clinical trial generated by external investigators increased over time as external investigators pursued complementary research objectives that achieved a comparable scientific impact. Structured data sharing mechanisms may further enhance the scientific impact of clinical trials. What is already known on this topicO_LISharing individual participant-level data (IPD) from clinical trials can promote transparency, reproducibility, and secondary research. C_LIO_LISeveral initiatives, including the Yale University Open Data Access (YODA) Project and government-supported data sharing platforms, provide external investigators with access to clinical trial data. C_LIO_LIWhile prior evaluations of secondary research generated from shared clinical trial data suggest that external investigators publications have citation impacts comparable to those of original trial investigators, overall evidence remains limited. C_LI What this study addsO_LIAnalysis of 336 industry-sponsored clinical trials with IPD shared through the YODA Project showed that most generated secondary publications, by both original trial investigators and external investigators. C_LIO_LIThe proportion of secondary publications from any clinical trial generated by external investigators increased over time, and compared with those generated by the original trial investigators, these publications more frequently use pooled analyses and focus on predictive or prognostic modelling and the development and validation of statistical methods. C_LIO_LISecondary publications generated by external investigators were more often published in higher-impact journals and received higher Altmetric Attention Scores, but had lower annual citation counts and were less likely to be cited in clinical guidelines or policy documents than those generated by the original trial investigators. C_LI
MUTHUKA, J. K.; Zimunya, R.; Simengwa, A.; Onyango, C.; Oluoch, K. J.; Kioko, M. T.; Mbari, D. K. F.; Nzioki, J. M.; Chebungei, L. K.; Kim, S.; Nshimirimana, D. A.
Show abstract
This systematic review and meta-analysis aimed to estimate the overall effectiveness of ASD interventions and identify sources of heterogeneity using frequentist and Bayesian approaches. A systematic search of PubMed/MEDLINE, Embase, Web of Science, and Scopus was conducted for studies published between January 1, 2004, and April 30, 2025. Mainly, randomized controlled trial studies with extractable intervention outcomes were included. A total of 41 studies (n=3,008) were synthesized using random-effects models (REML), Bayesian hierarchical modeling, meta-regression, and sensitivity analyses following PRISMA guidelines. The pooled random-effects estimate showed a significant positive effect of ASD interventions (effect size = 0.506, 95% CI: 0.392-0.619; z = 8.72, p < .001), corresponding to an estimated success proportion of 62% (95% CI: 59%-65%). Heterogeneity was substantial (Q{square}(40) = 238.78, p < .001; I{superscript 2} = 82.45%; {tau}{superscript 2} = 0.069, 95% CI: 0.028-0.137; {tau} = 0.262), with H{superscript 2} = 5.70 and a wide prediction interval (-0.020 to 1.031), indicating strong between-study variability. Bayesian meta-analysis confirmed a comparable effect (posterior mean = 0.619, 95% CrI: 0.592-0.646), with {tau} = 0.273 and I{superscript 2} {approx} 82.5%, and MCMC diagnostics showed stable convergence (R-hat {approx} 1.00). Publication bias analyses indicated significant funnel plot asymmetry (Egger-type regression: z = 3.429, p < .001; weighted regression: t = 9.573, p < .001), while rank correlation was non-significant ({tau} = -0.178, p = .103). Trim-and-fill analysis imputed 10 studies, reducing the pooled effect to 0.374 (95% CI: 0.258-0.491; {tau} = 0.338), though the effect remained significant (p < .001). Sensitivity analyses excluding influential studies yielded a stable effect (0.505, 95% CI: 0.401-0.609), with persistent heterogeneity (I{superscript 2} = 75.49%; Q{square}(38) = 190.21, p < .001; {tau}{superscript 2} = 0.043). Subgroup analyses showed highest effects for digital/technology-based interventions (0.672; 67%; I{superscript 2} = 0%), followed by nutritional (0.635; 64%; I{superscript 2} = 73.81%), behavioral (0.630; 63%; I{superscript 2} = 74.78%), and pharmacological (0.627; 63%; I{superscript 2} = 0%) interventions, while physical/occupational therapies showed lower effects (0.523; 52%; I{superscript 2} = 63.35%) and combined interventions showed borderline effects (0.593; 59%; I{superscript 2} = 19.96%); subgroup differences were significant (Q{square}5) = 22.63, p < .001). Regional effects were similar and non-significant (Q{square}(2) = 0.73, p = .694): North America (0.619; I{superscript 2} = 84.21%), Europe (0.626; I{superscript 2} = 62.34%), and Asia (0.659; I{superscript 2} = 0%). Age at intervention onset did not significantly moderate effects (Q{square}(5) = 0.98, p = .964), although variability was observed across children, adolescents, adults, and toddlers. Meta-regression identified significant moderators including intervention context (Q{square}= 18.159, p = .020), outcome domain (Q{square}= 19.588, p = .003), age at start (Q{square}= 17.795, p = .003), and intervention category (Q{square}= 31.714, p < .001), while follow-up duration and intervention duration were not significant. Bayesian subgroup analyses confirmed robustness, with strongest evidence for pharmacological (BF [->] {infty}), behavioral (BF {approx} 832.50), and digital interventions (BF {approx} 30.92). In conclusion, ASD interventions demonstrate a moderate and statistically significant overall effect ([~]0.50-0.62), with substantial heterogeneity driven primarily by intervention type, context, and participant characteristics, and findings were consistent across frequentist, Bayesian, and sensitivity analyses, supporting robust but context-dependent effectiveness.
Fulbright, H. A.; Morrison, K.
Show abstract
Background: For evidence syntheses using English language limits, several different methods and approaches are available. Objective: To understand the English language (EL) limits available on Ovid MEDLINE and Embase and the application of language metadata on these databases. To compare the impact of five EL limits versus removing non-English language (NEL) records during screening. Methods: Using the records included at full text screening or excluded on NEL status during screening in seven evidence syntheses, we tested five EL limits on 1,509 MEDLINE and 1,584 Embase records. 'Includes' removed or 'NEL excludes' retrieved were investigated. Results: All EL limits performed identically, 99.8% of MEDLINE 'includes' were retrieved versus 99.7% on Embase. All five 'includes' incorrectly removed with EL limits had language metadata errors. Although 98.2% MEDLINE and 94.6% Embase 'NEL excludes' were removed with EL limits, eight MEDLINE and nine Embase records were available in English. Discussion: The risk of excluding potentially eligible records due to language restrictions (whether applied during the strategies or screening) could be mitigated with forward and backward citation searching. Conclusion: EL limits risk removing records with incorrect language metadata. However, EL records might also be excluded on language during screening.
Gartlehner, G.; Banda, S.; Callaghan, M.; Chase, J.-A.; Dobrescu, A.; Eisele-Metzger, A.; Flemyng, E.; Gardner, S.; Griebler, U.; Helfer, B.; Jemiolo, P.; Macura, B.; Minx, J. C.; Noel-Storr, A.; Rajabzadeh Tahmasebi, N.; Sharifan, A.; Meerpohl, J.; Thomas, J.
Show abstract
BackgroundArtificial intelligence (AI) has the potential to improve the efficiency of evidence synthesis and reduce human error. However, robust methods for evaluating rapidly evolving AI tools within the practical workflows of evidence synthesis remain underdeveloped. This protocol describes a study design for assessing the effectiveness, efficiency, and usability of AI tools in comparison to traditional human-only workflows in the context of Cochrane systematic reviews. MethodsMembers of the Cochrane Evaluation of (Semi-) Automated Review (CESAR) Methods Project developed an adaptive platform study-within-a-review (SWAR) design, modeled after clinical platform trials. This design employs a master protocol to concurrently evaluate multiple AI tools (interventions) against a standard human-only process (control) across three key review tasks: title and abstract screening, full-text screening, and data extraction. The adaptive framework allows for the addition or removal of AI tools based on interim performance analyses without necessitating a restart of the study. Performance will be assessed using metrics such as accuracy (sensitivity, specificity, precision), efficiency (time on task), response stability, impact of errors, and usability, in alignment with Responsible use of AI in evidence SynthEsis (RAISE) principles. ResultsThe study will generate comparative data about the performance and usability of specific AI tools employed in a semi- or fully automated manner relative to standard human effort. The protocol provides a flexible framework for the assessment of AI tools in evidence synthesis, addressing the limitations of static, one-time evaluations. DiscussionThis study protocol presents a novel methodological approach to addressing the challenges of evaluating AI tools for evidence syntheses. By validating entire workflows rather than individual technologies, the findings will establish an evidence base for determining the viability of integrating AI into evidence-synthesis workflows. The adaptive design of this study is flexible and can be adopted by other investigators, ensuring that the evaluation framework remains relevant as new tools emerge.
Jafari, H.; Chu, P.; Lange, M.; Maher, F.; Glen, C.; Pearson, O. J.; Burges, C.; Martyn, M.; Cross, S.; Carter, B.; Emsley, R.; Forbes, G.
Show abstract
Background: Statistical Analysis Plans (SAPs) are essential for trial transparency and credibility but are resource-intensive to produce. While Large Language Models (LLMs) have shown promise in drafting protocols, their ability to generate high-quality, protocol-compliant SAPs remains untested against current content guidance. This study developed and validated an LLM-based pipeline for drafting SAPs from clinical trial protocols. Methods: We developed a structured, section-by-section prompting pipeline aligned with standard SAP guidance. We applied this pipeline to nine clinical trial protocols using three leading LLMs: OpenAI GPT-5, Anthropic Claude Sonnet 4, and Google Gemini 2.5 Pro. The resulting 27 SAPs were evaluated against a 46-item quality checklist derived from the published SAP guidelines. Items were double-scored by independent trial statisticians on a 0 to 3 scale for accuracy. We compared performance across LLMs and between item types (descriptive vs. statistical reasoning) using mixed-effects logistic regression. Results: Across 9 trials, the models produced SAP drafts with high overall accuracy (77% to 78%), with no difference in performance between the three LLMs (p=0.79) but varied by content type (p < 0.001). All models performed well on descriptive items (e.g., administrative details, trial design), with lower accuracy for items requiring statistical reasoning (e.g., modelling strategies, sensitivity analyses). Accuracy for statistical items ranged from 67% to 72%, whereas descriptive items achieved 81% to 83% accuracy. Qualitatively, models were prone to specific failure modes in complex sections, such as omitting necessary details for secondary outcome models or hallucinating sensitivity analyses. Discussion: Current LLMs can effectively draft portions of SAPs, offering the potential for substantial time savings in trial documentation. However, a human-in-the-loop approach remains mandatory; while models demonstrate strong capability in producing descriptive content, their independent application to complex statistical methodology design still requires further methodological development and training. Future work should explore advanced prompt engineering, such as retrieval-augmented generation or agentic workflows, to improve reasoning capabilities.
Panagiotopoulos, A.-P.; Laskaris, A.; Tsakri, D.; Manoussopoulos, Y.; Anastassopoulou, C.; Tsakris, A.; Ioannidis, J.
Show abstract
Objectives To quantify the frequency of baseline control-group use in published long COVID prevalence studies and assess their key methodological features. Design Cross-sectional meta-epidemiological evaluation of published post-acute COVID-19 prevalence studies, supplemented by a corresponding-author survey. Setting Published studies identified through a systematic review by Hou et al. (2025) and supplementary data obtained through direct email contact with corresponding authors. Participants A total of 440 published long COVID prevalence studies. Main Outcome measures Presence and type of comparator group, reliance on solely self-reported outcomes, acknowledgment of lack of a control group among uncontrolled studies, and availability of additional comparator data through author survey. Results Among 440 studies, 372 (84.5%) reported no control group on their publication. Healthy or uninfected comparators were reported in 55 studies (12.5%) and other comparator types in 14 (3.2%); 1 study included both categories. Solely self-reported outcomes were used in 279 studies (63.4%). Among 372 uncontrolled studies, 244 (65.6%) did not explicitly acknowledge the absence of a baseline comparator as a limitation anywhere in text. Corresponding authors of 140 studies (31.8%) responded to the survey; among them, 126 (90.0%) reported no additional comparative data, while 14 (10.0%) mentioned some available comparative datasets (19 additional datasets). Almost all of that information (10/14, 17/19) had been already published in other articles not captured by the Hou et al. systematic review. Conclusions Most published long COVID prevalence studies lacked comparator groups and relied exclusively on self-reported outcomes without acknowledging this limitation. Direct author contact identified little additional comparator information. Much of the long COVID prevalence literature may therefore be poorly suited to estimating burden attributable specifically to SARS-CoV-2, underscoring the need for appropriately matched comparators and more objective outcome assessment. Registration The protocol was prospectively registered on the Open Science Framework (https://osf.io/f4hra).
Chenggong, X.; Weichang, K.; Liuting, P.; Diaoxin, Q.; Yuxuan, Y.; Bin, W.; Liang, H.
Show abstract
ObjectiveTo systematically evaluate the diagnostic performance of large language models (LLMs) in automated medical literature screening and to determine their potential role in supporting evidence synthesis workflows. MethodsA systematic review and meta-analysis was conducted according to PRISMA DTA guidance. PubMed, Web of Science, Embase, the Cochrane Library and Google Scholar were searched from 1 January 2022 to 17 November 2025. Studies assessing LLMs for automated title and abstract screening or full-text eligibility assessment in medical literature were included. Diagnostic accuracy metrics were extracted and pooled using a bivariate random effects model and hierarchical summary receiver operating characteristic (HSROC) analysis. Subgroup analyses and meta-regression were performed to explore sources of heterogeneity. ResultsEighteen studies published between 2023 and 2025 were included. In title and abstract screening, the pooled sensitivity was 0.92 and pooled specificity was 0.94. The SROC area under the curve (AUC) reached 0.98. In full-text screening, pooled sensitivity and specificity both reached 0.99 and the AUC was 0.99. Prompt strategies incorporating examples or chain-of-thought reasoning significantly improved sensitivity. Across studies, most models were deployed without task specific fine tuning and still achieved strong performance. Subgroup analyses and meta regression did not identify significant sources of heterogeneity. Many studies also reported substantial efficiency gains, including large reductions in screening workload, time and cost. ConclusionLLMs demonstrate high diagnostic accuracy for automated medical literature screening, particularly in full-text assessment. These models show strong potential as high sensitivity assistive tools that can substantially reduce manual screening burden while supporting evidence synthesis. Further methodological optimization and validation in large scale real-world settings are required to establish their long term role in evidence-based medicine.
Markozannes, G.; Jayedi, A.; Cividini, S.; Kazmi, S. Z.; Cariolou, M.; Vieira, R.; Pagkalidou, E.; Kiss, S.; Balducci, K.; Aune, D.; Gunter, M. J.; Cross, A. J.; Chan, D. S. M.; Tsilidis, K. K.
Show abstract
BackgroundThe 2018 World Cancer Research Fund (WCRF)/American Institute for Cancer Research Third Expert Report (TER) on diet, adiposity, physical activity and risk of 19 cancers could be enhanced with new data. A framework is needed to prioritize future systematic reviews. MethodsWe searched PubMed (January 2019-February 2024) for meta-analyses, pooled analyses, randomized controlled trials (RCTs), Mendelian randomization (MR) studies, and large (>100,000 participants) cohort studies. We assessed TER findings using conditional power (CP) and fail-safe number (FSN) statistics. We developed an exposure-based prioritization score (PS) by awarding or subtracting points considering the quantity, statistical significance, direction, and novelty of associations. ResultsWe compared 366 meta-analyses, 121 pooled analyses, 19 RCTs, 174 MR studies, and 391 cohort studies covering 151 exposures and 28 cancers with 1,371 TER meta-analyses. Based on CP, non-significant TER associations likely to become significant with additional evidence included folate and colorectal, waist circumference and lung, total fat and ovarian, tea and ovarian, and red meat and kidney cancers. The FSN indicated that most significant TER associations are unlikely to change with additional evidence. The median PS was 6 (range: -15 to 163), with top scores observed for anthropometric measurements (PSheight=40 to PSBMI=163), physical activity (PS=100), sedentary behavior (PS=64), alcohol (PS=52), tea (PS=36), dietary fiber (PS=31), milk/dairy (PS=29), micronutrients (PSretinol=27 to PSiron=38), vitamins (PSB12=22 to PSvitD=91), soy (PS=24), isoflavones (PS=23), and sugar-sweetened beverages (PS=22). Conclusions and ImpactThe prioritization framework can help identify impactful systematic reviews to complement TER conclusions and enhance our understanding of emerging research.
Barreto, G. H. C.; Burke, C.; Davies, P.; Halicka, M.; Paterson, C.; Swinton, P.; Saunders, B.; Higgins, J. P. T.
Show abstract
BackgroundSystematic reviews are essential for evidence-based decision making in health sciences but require substantial time and resource for manual processes, particularly title and abstract screening. Recent advances in machine learning and large language models (LLMs) have demonstrated promise in accelerating screening with high recall but are often limited by modest gains in efficiency, mostly due to the absence of a generalisable stopping criterion. Here, we introduce and report preliminary findings on the performance of a novel semi-automated active learning system, JARVIS, that integrates LLM-based reasoning using the PICOS framework, neural networks-based classification, and human decision-making to facilitate abstract screening. MethodsDatasets containing author-made inclusion and exclusion decisions from six published systematic reviews were used to pilot the semi-automated screening system. Model performance was evaluated across recall, specificity and area under the curve precision-recall (AUC-PR), using full-text inclusion as the ground truth. Estimated workload and financial savings were calculated by comparing total screening time and reviewer costs across manual and semi-automated scenarios. ResultsAcross the six review datasets, recall ranged between 98.2% and 100%, and specificity ranged between 97.9% and 99.2% at the defined stopping point. Across iterations, AUC-PR values ranged between 83.8% and 100%. Compared with human-only screening, JARVIS delivered workload savings between 71.0% and 93.6%. When a single reviewer read the excluded records, workload savings ranged between 35.6 % and 46.8%. ConclusionThe proposed semi-automated system substantially reduced reviewer workload while maintaining high recall, improving on previously reported approaches. Further validation in larger and more varied reviews, as well as prospective testing, is warranted.
Bannett, Y.; Pillai, M.; Huang, T.; Luo, I.; Gunturkun, F.; Hernandez-Boussard, T.
Show abstract
ImportanceGuideline-concordant care for young children with attention-deficit/hyperactivity disorder (ADHD) includes recommending parent training in behavior management (PTBM) as first-line treatment. However, assessing guideline adherence through manual chart review is time-consuming and costly, limiting scalable and timely quality-of-care measurement. ObjectiveTo evaluate the accuracy and explainability of large language models (LLMs) in identifying PTBM recommendations in pediatric electronic health record (EHR) notes as a scalable alternative to manual chart review. Design, Setting, and ParticipantsThis retrospective cohort study was conducted in a community-based pediatric healthcare network in California consisting of 27 primary care clinics. The study cohort included children aged 4-6 years with [≥] 2 primary care visits between 2020-2024 and ICD-10 diagnoses of ADHD or ADHD symptoms (n=542 patients). Clinical notes from the first ADHD-related visit were included. A stratified subset of 122 notes, including all cases with model disagreement, was manually annotated to assess model performance in identifying PTBM recommendations and rank model explanations. ExposuresAssessment and plan sections of clinical notes were analyzed using three generative large language models (Claude-3.5, GPT-4o, and LLaMA-3.3-70B) to identify the presence of PTBM recommendations and generate explanatory rationales and documentation evidence. Main Outcomes and MeasuresModel performance in identifying PTBM recommendations (measured by sensitivity, positive predictive value (PPV), and F1-score) and qualitative explainability ratings of model-generated rationales (based on the QUEST framework). ResultsAll three models demonstrated high performance compared to expert chart review. Claude-3.5 showed balanced performance (sensitivity=0.89, PPV=0.95, and F1-score=0.92) and ranked highest in explainability. LLaMA3.3-70B achieved sensitivity=0.91, PPV=0.89, and F1-score=0.90, ranking second for explainability. GPT-4o had the highest PPV [0.97] but lowest sensitivity [0.82], with an F1-score of 0.89 and the lowest explainability ranking. Based on classifications from the best-performing model, Claude-3.5, 26.4% (143/542) of patients had documented PTBM recommendations at their first ADHD-related visit. Conclusions and RelevanceLLMs can accurately extract guideline-concordant clinician recommendations for non-pharmacological ADHD treatment from unstructured clinical notes while providing clear explanations and supporting evidence. Evaluating model explainability as part of LLM implementation for medical chart review tasks can promote transparent and scalable solutions for quality-of-care measurement.
Forbes, C.; Carter, M.; Hudson, C.; Glasziou, P.; Clark, J.
Show abstract
Systematic Reviews (SRs) are the gold standard for evidence synthesis, but the manual title and abstract screening of thousands of references creates a severe bottleneck. Existing automated tools have historically struggled to achieve the near-perfect recall (sensitivity) required for reliable reviews. We developed MechaScreener as a "zero-shot" automated screening tool that utilises a Large Language Model (LLM) to rank article relevance. The tool requires no initial training data or manual pre-screening, as MechaScreener directly applies user-provided question elements (PICO) or inclusion/exclusion criteria to assign an inclusion probability score (1-5) to each reference. We evaluated the tool in two phases: a development phase using five reference libraries to optimise prompts, and an independent evaluation phase using 10 diverse Cochrane review libraries (comprising both randomised controlled trials and non-RCTs) containing over 58,000 references. In the evaluation dataset, MechaScreener achieved a perfect mean recall of 1.00 (100%, pooled 95% CI: 0.98-1.00), ensuring no relevant articles were missed. Concurrently, it achieved an overall mean specificity of 0.61 (61%, pooled 95% CI: 0.59-0.60). Specificity varied: from 0.21 in broad public health topics to 0.91 in precise pharmacological interventions-reflecting the tools built-in conservatism when evaluating ambiguous abstracts. By safely eliminating over 60% of irrelevant literature during the initial screening phase without compromising recall, MechaScreener functions as a highly reliable but low-effort "first-pass" filter, allowing researchers to substantially reduce manual workloads and reallocate resources toward full-text review and data extraction.
Tai, K. H.; Varvara, G.; Escoffier, E.; Mansmann, U.; DeVito, N. J.; Vieira Armond, A. C.; Naudet, F.
Show abstract
Objective To map the presence, public availability, and content of clinical trial data sharing policies (DSP), data management and sharing plans (DMSP), and data use agreements (DUA) among the most prolific public and private clinical trial sponsors operating in the European Union, and to identify key areas of convergence, divergence, and constraint in the context of General Data Protection Regulation (GDPR). Eligibility criteria We included organisation-level documents describing approaches to clinical trial data sharing or data management from the top 20 public and top 20 private sponsors ranked by the number of trials registered in the EU Clinical Trials Information System (CTIS). Eligible materials comprised publicly available or sponsor-shared policies, guidelines, statements, templates, and agreements relevant to clinical trial data sharing or management. Sources of evidence Evidence was identified through systematic searches of sponsors' public websites, structured Google searches, and major data management plan platforms (DMPTool, DMPonline, DMP Assistant), complemented by direct contact with sponsors to verify findings and request missing documentation. All sources were archived and catalogued. Charting methods Two reviewers independently extracted data using a structured form, capturing the existence, accessibility, and content of data sharing policies, data management and sharing plans, and data use agreements. Quantitative data were summarised descriptively, and a non-interpretive descriptive content analysis was conducted to characterise recurring policy elements and areas of heterogeneity. Results Among 40 sponsors, private sponsors were substantially more likely than public sponsors to make trial-specific data sharing policies and data use agreements publicly accessible, often via established data sharing platforms. Public sponsors more frequently referenced data management and sharing plans, but these were heterogeneous in scope and often embedded within broader institutional governance documents rather than tailored to clinical trials. Across sectors, GDPR compliance, data protection, and legal safeguards were emphasised, while operational aspects such as dataset readiness, review criteria, and downstream responsibilities varied widely. Overall response rate to sponsor verification was 37.5%. Conclusion Clinical trial data sharing governance in the EU shows a marked sectoral imbalance among the top sponsors. Private sponsors tend to provide more detailed and operationally explicit documentation, whereas public sponsors often articulate high-level commitments without trial-specific guidance. Greater clarity and standardisation, particularly among public sponsors, could improve transparency and facilitate responsible data reuse, while remaining compatible with GDPR requirements.